Prognostic risk models for incident hypertension: A PRISMA systematic review and meta-analysis

Objective Our goal was to review the available literature on prognostic risk prediction for incident hypertension, synthesize performance, and provide suggestions for future work on the topic. Methods A systematic search on PUBMED and Web of Science databases was conducted for studies on prognostic risk prediction models for incident hypertension in generally healthy individuals. Study-quality was assessed using the Prediction model Risk of Bias Assessment Tool (PROBAST) checklist. Three-level meta-analyses were used to obtain pooled AUC/C-statistic estimates. Heterogeneity was explored using study and cohort characteristics in meta-regressions. Results From 5090 hits, we found 53 eligible studies, and included 47 in meta-analyses. Only four studies were assessed to have results with low risk of bias. Few models had been externally validated, with only the Framingham risk model validated more than thrice. The pooled AUC/C-statistics were 0.82 (0.77–0.86) for machine learning models and 0.78 (0.76–0.80) for traditional models, with high heterogeneity in both groups (I2 > 99%). Intra-class correlations within studies were 60% and 90%, respectively. Follow-up time (P = 0.0405) was significant for ML models and age (P = 0.0271) for traditional models in explaining heterogeneity. Validations of the Framingham risk model had high heterogeneity (I2 > 99%). Conclusion Overall, the quality of included studies was assessed as poor. AUC/C-statistic were mostly acceptable or good, and higher for ML models than traditional models. High heterogeneity implies large variability in the performance of new risk models. Further, large heterogeneity in validations of the Framingham risk model indicate variability in model performance on new populations. To enable researchers to assess hypertension risk models, we encourage adherence to existing guidelines for reporting and developing risk models, specifically reporting appropriate performance measures. Further, we recommend a stronger focus on validation of models by considering reasonable baseline models and performing external validations of existing models. Hence, developed risk models must be made available for external researchers.


Introduction
Hypertension is considered the number one preventable risk factor of cardiovascular disease (CVD) and all-cause death globally [1].The number of individuals suffering from hypertension effectively doubled in the period 1990-2019 to an estimated 1.4 billion individuals [1,2].Despite global mean blood pressure (BP) remaining still or even decreased slightly over the past decades due to more effective tools in managing BP, prevalence of hypertension has increased, especially in low-to middle-income countries [3].Yet, upwards of 50% of individuals participating in structured screening programs were not aware of their elevated blood pressure, regardless of their income level in the country they were from [2,4].One explanation for this unawareness is the predominantly asymptomatic nature of hypertension, highlighting the importance of individuals paying attention to their BP [2].
Despite the existence of effective prevention strategies in the form of lifestyle management and drug treatment, prevalence of hypertension has steadily increased [2,3].The potential of these strategies has motivated research into identifying individuals at an earlier stage, including the development of prognostic risk models.Risk models can be seen as one way of moving towards personalized medicine, as risk estimates can be estimated for an individual based on their unique set of clinical predictors [5].
Many risk models for hypertension in the general population have been developed in recent years.Two earlier reviews provided a narrative synthesis of the available risk models at the time, which were mostly developed using traditional regression-based models [6,7].Recognizing the popularization and availability of machine learning (ML) methods, a third review [8] expanded upon the two prior by including ML models.ML and traditional models were analyzed separately in meta-analyses, where high heterogeneity was noted in both cases.A quality assessment of the literature indicated generally low risk of bias (ROB) for traditional models, while no assessment was made on machine learning models [8].
In all three reviews, little distinction was made between diagnostic risk models and prognostic risk models.Although modelling of diagnostic and prognostic models has similarities and even share the same guidelines for appropriate development, they are different in their aim and intended clinical use.A diagnostic model only extends to estimating the risk of existing disease, whereas prognostic models provide risk estimates for a prediction horizon.This implies a difference in the clinical use case between the two model types.Prognostic models may be an auxiliary tool for clinical practitioners.In providing risk estimates of future incidence, they may allow for early intervention and personalized long-term health planning with the goal of preventing incidence [5].
Considering this, we restricted our focus to only considering prognostic risk models for the general population.Using meta-analyses, we synthesize available evidence to quantitatively

Inclusion and exclusion criteria
Records were eligible if they fulfilled the following criteria: • Utilized data from a prospective or retrospective cohort, • The population at baseline consisted of normotensive adults drawn from the general population, • The primary goal was the development of a model or tool for risk estimation, • The outcome was prognostic risk of incident primary hypertension as a binary trait, • The models were evaluated on a dataset and performance measures were reported, • Written in English.
Exclusion criteria were: • Simulation studies, • Unpublished research studies, • Studies concerned with any form of secondary hypertension, • Studies concerned with any other hypertensive diseases, e.g., gestational, ocular, intra-cranial, pulmonary, isolated systolic, isolated diastolic, • Association studies or studies where the impact of one or a few similar predictors were the primary focus, • Reviews of the literature

Selection of studies
After removing duplicates, articles were subsequently screened by title and abstract for relevancy.Articles selected by their title and abstract were then assessed by their full text for eligibility.Works citing, and references of the articles already included were searched for potentially eligible articles not found during the database search.Primary reasons for exclusions after full-text read were detailed following the criteria.We used citation tracking of included records to assess whether the clinical impact of any risk models had been validated in any subsequent publications.

Data extraction
For each included study, we collected Population, Intervention, Comparison, Outcome, Timeframe, and Study design (PICOTS) items and information related to bias assessment and meta-analyses.
We also assessed model availability, i.e., whether the developed models could be readily adopted by external researchers or others.A developed model was deemed available if described or presented in text or figure to a sufficient degree for application, provided as a web-tool, or included as downloadable software linked to the publication.

Risk of bias
The risk of bias within studies was assessed independently by one reviewer (F.S.) using the 'Prediction model Risk of Bias Assessment Tool' (PROBAST) in a two-step process [11].Initially, a short form version of PROBAST was used to quickly assess studies following a simplified assessment.While the simplified form is less detailed, it has a perfect sensitivity in recognizing articles with high risk of bias [12].The articles not marked high risk of bias during the initial step were then assessed using the full PROBAST form on all reported models.The bias assessment was subsequently validated by a second reviewer (F.L.).Differences in opinion were resolved through discussion with the third reviewer (I.S.).
The initial step using the short form PROBAST was motivated by its time-saving potential, while still ensuring a perfect true positive rate in studies and models marked high risk of bias.Where the original PROBAST form had unspecified numerical criteria, we used thresholds suggested in the short form version [12].Specifically, these were: • The sufficient ratio for events-per-variable (EPV) was set to 20 when candidate predictors could be identified, or 40 for final predictors of the model, if not.
• Deletion of data due to missing covariates was unacceptable if more than 5% of included participants was removed, calculated after appropriate exclusions had been applied.
• Doing univariable as opposed to multivariable predictor selection or lack of optimism assessment could be ignored if the EPV was above 25 for candidate predictors, or 50 for final model predictors if candidate predictors were not detailed.
Note, in the case of an external validation of a model, EPV was considered sufficient if the minority outcome had more than 100 events, as described in PROBAST [11].

Analyses
We used descriptive statistics to summarize our findings.Within each included article, we identified all applications of risk models for incident hypertension and detailed the method and setup used in model development and validation.Bias in studies was assessed for describing trustworthiness of model results.
Considering earlier reviews on the topic, it was anticipated that the Area Under the Receiver Operator Curve, abbreviated AUROC or just AUC, and the Concordance-statistic, abbreviated C-statistic, would be the most reported performance measures [6][7][8].These statistics are discriminatory measures and are equivalent in the binary outcome setting when info on event times are not used [13].As for calibration measures, the Hosmer-Lemeshow statistic was expected to be the most widely reported [14].Meta-analyses on reported AUC/C-stat.measures were fitted separately for traditional models and ML models, as done in an earlier review [8].Calibration was not used in meta-analyses due to incompleteness and variation in reporting.
Further, based on earlier reviews, we assumed some models had been externally validated by independent researchers.We performed meta-analyses for the Framingham risk model to assess the expected performance and heterogeneity in a situation where variation in model development was not relevant.No other model had been externally validated to an extent that would allow a separate analysis.
Meta-analyses and regressions were calculated using the metafor package in R [15].In articles where risk models for hypertension had been developed or externally validated, it is common to report multiple results.This may be to test various aspects of model development or the use of different datasets.In the context of a meta-analysis or regression, this was a problem due to the possible correlation between results that are reported using the same datasets.Metaanalyses and regressions can accommodate this interdependency if the exact covariances or correlations between results are given.However, estimates for within-study covariance or correlation are often not reported in literature [9], meaning another approach must be used in analyses.Naïve inclusion, i.e., assuming zero within-study correlation, of all results would overemphasize the importance of the studies that reported the most results.Aggregating results per study or randomly selecting singular studies has been proposed as an alternative but would imply the loss of statistical information and is not considered ideal [16].We opted to address this issue by selecting a subset of results found in included studies as well as applying a three level meta-analysis model that can account for some of the described correlation.
We selected the subset of results by the following considerations: • For nested models we did not include results for the reduced models.
• When several model performance measures for different validation procedures were reported for a new risk model, only one score was included based on how they were calculated, with priority (from high to low): Bootstrapping/cross-validation, then test dataset results, then development data.
• Models with only one or two predictors were excluded unless derived as such during model development.
• Complete information on AUC/C-stat.standard error or the information needed to approximate it was reported.
• Where discrete models like nomograms or risk scores were derived from continuous models, we included only results from the continuous model if available.
These considerations were made per modelling method, gender, and the studies mean/ median follow-up time reported in each article.
In using a three-level model in our meta-analyses, we allow heterogeneity to be estimated at two levels as opposed to just one.We grouped results within the studies they were reported, modelling heterogeneity at the study level as well as the level of individual results.In practice, if results from the same studies were more similar within each study compared to results across studies, most heterogeneity would be estimated at the study level.Although a three-level model is not a perfect reflection of the true correlation structure, it will likely produce a model closer to reality versus a two-level model ignoring within-study correlation [17].To evaluate the three-level model compared to a two-level model, we performed a likelihood ratio test.
The three-level model does not fully solve the issue of sample error correlations not being reported in studies.Rather, it produces an approximation assuming all covariances between individual results within the same study are equal [16].To account for this, we set missing sample-error correlations to zero and subsequently apply the ClubSandwich estimator for three-level models.This estimator is robust to slight model misspecification that arises from ignoring sampling errors [17].
We performed a sensitivity analysis to assess the possible impact of ignoring sample-error correlation: We sampled random sample-error correlations for results within each study and ran our meta-analysis, repeating the procedure a total of 1000 times.This provided distributions of relevant parameter estimates such as heterogeneity and mean effect which allowed us to estimate how much relevant parameters could be affected had we known the correct sample-error correlations.
In the third analysis, we applied a three-level meta-analysis to the external validations of the Framingham risk model.In the case of studies reporting multiple validation results, we fixed the intra-class correlation to that estimated by the meta-analysis on the traditional models.For the external validations of the Framingham risk model, we excluded the results which included diabetics in the data as the original model was not developed for individuals with diabetes [18].
Moderators were included in meta-regressions to assess whether they could explain some of the heterogeneity.We included the following moderators: Region (Americas-Europe-Asia), mean/median age of cohort at baseline, median/mean follow-up time of study, number of participants in study, and the incidence-rate in data.Gender (men-women-all) was included for traditional models, but not for ML models due to homogenous distribution.Mean/median blood pressure of cohort at baseline was considered but was not reported sufficiently for inclusion.Each moderator was tested as a single moderator.In addition, one analysis was performed with all moderators included simultaneously.P-values below 0.05 were considered significant.All meta-analyses and meta-regressions were calculated using the REstricted Maximum Likelihood (REML) estimator.
For all analyses, the AUC/C-stat.was transformed using the logit transformation.The logit transform of standard deviations were approximated, and if not reported, fully estimated using the equations provided by Debray et al. [9].Studies with insufficient information on sampling error or other data needed to estimate standard deviations were left out of meta-analyses.
We did not assess publication bias by any statistical tests or funnel plot asymmetry.Analyses were calculated using the R language (version: 4.3) and the RStudio IDE, with the software libraries tidyverse used to handle data and metafor to perform meta-analyses and metaregressions [15,[19][20][21].Plots and figures were created using ggplot and ggh4x [22,23].All code and data required to reproduce the results and associated figures are provided in S1 and S2 Files.

Study details
A PRISMA flow diagram detailing the search process can be seen in Fig 1 .From an initial pool of 5090 unique records provided by our search terms, we found 46 eligible records, with seven more discovered in citation analysis.In total, 53 articles were included in the review [18,.The cohort origin of studies is detailed in Table 1 with selected summary statistics presented in Table 2.We did not find any impact studies of the included risk models.
Among the included articles, cohorts from Asia were utilized the most, followed by North America and Europe.Only one article used a cohort from South America.None were found for any African or Oceanian populations.The number of individuals included in cohorts had a median of 6454 individuals and ranged from 297 to hospital electronic medical record (EMR) datasets containing more than 823 000 individuals.All included studies had a follow-up period of one year or more, with a median of five years.Fava et al. was an extreme outlier with a median 23-year follow-up [31].Lastly, the number of model results reported per study varied from 1 to 22, with a median of 5.
The definition of hypertension used in studies was mostly consistent with ESC/ESH guidelines: 39 studies used systolic blood pressure above 140, diastolic blood pressure above 90, or the use of medication related to managing elevated blood pressure to define hypertension [2].Use of medication or any existing diagnosis of hypertension was often reported by participants themselves.10 studies relied on existing diagnosis-codes like 'ICD-10' or related in EMRs, or a diagnosis predetermined by medical professionals.In a single study, an annual questionnaire was used to determine the presence of hypertension.Two studies did not provide details on how hypertension was defined in their study.One individual study used systolic blood pressure above 130, diastolic blood pressure above 80, or the use of medication related to managing elevated blood pressure to define a presence of hypertension.This is in line with the recommendations defined by the American College of Cardiology (ACC) and American Heart Association (AHA) [76].

Modelling methods
Of the 53 included studies, 50 developed a new risk model.Traditional algorithms were used in 44 studies, and machine learning in 14 studies.Hence, eight studies developed at least one model using a method from each group of algorithms.
Studies developing traditional models used mostly the same methods: In cohorts with varying follow-up times, this was Weibull or Cox regression in more recent years.Otherwise, logistic regression was mostly used.12 of the 44 studies presented a risk score table or nomogram derived from a fitted parametric model, primarily with the intention of simplifying the model for clinical use.
In terms of machine learning algorithms, a total of 13 different algorithms had been used to develop new risk models.The most popular were Random Forest, Artificial neural networks, XGBoost, and Support Vector Machine algorithms.An overview of methods with summarized results from each article is given in Table 3.

Performance measures
The most frequent reported performance measure for discrimination was the AUC or the Cstatistic.Calibration measures were reported less consistently and with more variation in method, with the Hosmer-Lemeshow statistic being the most popular.Other notable methods were graphical assessment of calibration plots, curves, or distributions, reporting the predicted-observed ratio, or other statistical tests such as the Greenwood-Nam-D'Angelo test.
While no impact studies were found, we found four studies that reported measures for assessing clinical implication of their model.In these, the net benefit of the developed model was assessed and compared to alternatives using decision curves [59,63,75,77].Some articles reported performance measures for comparing models, such as the Net Reclassification Index.A few articles included more performance measures of discrimination, like accuracy, sensitivity, Brier score and others.A recurring issue for measures requiring a defined risk threshold for prognosis was that the threshold was not given.Due to a lack of consistent reporting on these measures, these results were not recorded.

Validation of models
We observed a large degree of variability in the validation during model development: 11 studies (22%) reported a form of cross-validation, bootstrapping, or other repeated sampling procedure in validating its results.A test set was the primary validation method in 23 articles (46%).16 studies (32%) reported only results from the data the model was derived from.Additionally, four studies used external datasets to perform validation of their results within the development study, shown in Table 4.

Externally validated models
Only six risk models were found to be externally validated in subsequent publications on new populations.The Framingham risk model was validated the most, by 16 studies with a large  regional diversity [18].The validation AUC/C-stat.ranged from 0.537 to 0.840, while the AUC/C-stat.was reported as 0.788 in the development study.Calibration varied from acceptable in some cohorts to severe mis-calibration in others.The second most validated model was the KoGES model, being validated in three external studies [33].The remaining examples had only one external validation by independent researchers each, see Table 4.

Variables
The most used variables in studies were age, systolic blood pressure, diastolic blood pressure, Body Mass Index (BMI), smoking, sex, and presence of hypertension in family.The top eight most used variables in the literature were the same used in the Framingham risk model [18].Note, in five studies applying ML methods, complete information about the variables used in final models was not presented [45,48,50,53,70].A summarized view of predictors used in studies can be seen in Fig 2.
Use of genetic information.In total, we found 10 studies investigating the efficacy of using detailed genetic information like genetic risk scores to improve risk prediction for incident hypertension [31,34,[36][37][38]44,55,58,63,66,73].In almost all cases, the resulting models' AUC/C-stat.was only slightly higher if not equal compared to a model without genetic information.Most applied traditional regression-based models which were limited to only capturing linear effects.In one study, the inclusion of a genetic or personal risk score (GRS/PRS) only improved modelling results for the ML models, which indicates a non-linear effect of the GRS [63].

Quality assessment of included studies
In total, four studies had results assessed as low risk of bias, all applying or validating traditional models.In the initial step using the short form PROBAST 41 studies were assessed to only have results with high risk of bias, see Fig 3 .The predominant remarks were improper handling or documentation of missing data in participants, too low events-per-variable (EPV) ratio, or lack of or improper optimism assessment of results.In the subsequent step, five of the remaining 12 studies were assessed as high risk of bias using the full PROBAST form, mainly due to lack of or insufficient calibration assessment, see

Meta-analyses
Due to the low completeness of reporting and lack of consistency in how calibration was assessed, we opted for only using the AUC/C-stat.performance measures in meta-analyses,   [12] for more details.Domains 3, 5 and 6 were not applicable for external validations.Six studies had remarks that were only valid for some of the reported results, e.g., due to the Events Per Variable (EPV) criteria being less strict for external validations or different methods used on some of the developed models.These were marked with a mixed "High/Low" symbol on relevant domain or overall assessment.'Dev.':Developed models, 'Ext.val.':External validations, 'NA': Not applicable.
Traditional models.Of the 44 studies developing models using traditional models, 39 had sufficient information for meta-analysis.From these, a total of 46 results were included.Five studies contributed two results and one study contributed three results.Only two of the 39 studies were assessed as having low risk of bias for model development, with another being unclear.Model fit was significantly improved using a three-level model compared to a twolevel model (P = 0.0119).
The estimated mean AUC/C-stat.was 0.779 (95% CI: 0.762-0.795).Heterogeneity was high (I 2 : > 99%, Cochrane's Q = 7325, P < 0.0001).Hence, the 95% prediction interval for new risk models was wide, at 0.660-0.865.The intra-class correlation for each data-source was estimated to be 90.4%,indicating variation was largely due to differences between studies.The contribution of each study to the pooled estimates was calculated to a median of 2.6% and all within 1.98% to 2.83%, meaning that influence of individual studies on our estimates was relatively evenly distributed.A forest plot is provided in Fig 5.
When including moderators, the mean/median age of the cohort at baseline (P = 0.0246), time from baseline to outcome determination (P = 0.0009), and outcome rate (P = 0.001) were found significant in univariable assessment.Only age was found significant (P = 0.0271) when all moderators were included simultaneously, while heterogeneity persisted (I 2 : 97%).
Machine-learning models.Of the 14 studies applying ML to develop risk models, 13 had sufficient information for meta-analysis.From these, a total of 53 results were included.None of the studies applying ML were assessed to have low risk of bias.The three-level model significantly improved model fit compared to a two-level model (P < 0.0001).
The estimated mean performance was 0.817 (95% CI: 0.767-0.858).Heterogeneity was high (I 2 : > 99%, Cochrane's Q = 22001, P < 0.0001).The 95% prediction interval for new models was 0.547-0.943.The intra-class correlation for each data-source was estimated to be 60.4%, indicating variation due to differences between studies was moderate.The contribution of each study to the pooled estimates was calculated at a median of 8% and all within 5.6% to 8.95%.Hence, the influence of each individual study was relatively even, despite a large variation in the number of results per study included in the analysis.
In meta-regressions, time from baseline to outcome determination was found significant (P = 0.0283) in univariable assessment.Due to the missing entries, only number of participants, time between baseline and outcome determination, and region were included simultaneously, where time between baseline and outcome determination was again significant (P = 0.0405).High heterogeneity persisted (I 2 : > 99%).A forest plot is provided in Fig 6.
Sensitivity analysis of ignoring sample-error correlation.None of the included studies reporting multiple results reported the sample-error correlation.However, the sensitivity analysis of ignoring sample error correlation concluded that the estimates were mostly unaffected.We suggest that this is likely due to the large scale of heterogeneity relative to the possible sample-error covariance that was ignored.We refer to S1 Appendix for more details.External validations of the Framingham risk model.From the external validations, we selected the 19 external validation results for the Framingham risk model found in 16 different studies.Of these, only three were assessed as low risk of bias, with another assessed as unclear risk.Three studies included two results.In Vo ¨lzke et al. [34], the two results were derived from unrelated cohorts, and we considered these as independent.For the other two, we fixed the intra-class correlation to 0.904 as estimated for the traditional regression-based models and applied a three-level model.We estimated the mean performance to 0.761 (95% CI: 0.722-0.795)with high heterogeneity (I2: > 99%, Cochrane's Q = 4268, P < 0.0001), and subsequently the 95% prediction interval as 0.571-0.883.The cohorts mean/median age at baseline was found significant (P = 0.013) in univariable assessment, without affecting heterogeneity much (I 2 : > 99%).Simultaneous inclusion of moderators was not performed due to the relatively low number of results available for the meta-regression.A forest plot is provided in Fig 7.

Discussion
Many risk models for incident hypertension have been developed in recent years with half of the included articles in this review being published in 2018 or later.Concurrent with the substantial increase in the number of relevant articles, a large variation was found in terms of how study cohorts were organized, which variables were used in modelling, and methods used for model development.Notably, while 15 different countries had been represented in study cohorts, only one study used a South American population, and no African or Oceanian populations had been used at all.The inclusion of genetic information along with clinical information was seen in multiple studies yet displayed little comparative improvement to models without it [31,34,[36][37][38]44,55,58,66].A single exception was found where ML models improved with the introduction of genetic information, but not the traditional model [63].This might suggest that nonlinear modelling should be considered for capturing the predictive information presented by genetic information.While Vo ¨lzke et al. [34] considered genetic information in the form of individual single nucleotide polymorphisms (SNPs) in Bayesian Networks, there was no direct comparison versus a model without genetic information.Overall, the included genetic information varied from individual SNPs to full genetic or risk scores for individuals or groups.
We found that only a small proportion of studies were assessed to have low risk of bias results.Improper deletion of individuals with missing data, lack of optimism assessment and improper or missing reporting of relevant performance measures were identified culprits in most articles.Extra care should go into interpreting these reported results, as results may be over-confident, and performance might not be as expected when the model is applied on a new cohort.Notably, most studies assessed as having high risk of bias had similar issues in its study methodology, as identified by the simplified PROBAST form.To improve reporting and study quality, the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines and the PROBAST assessment criteria themselves may be useful [11,12,79,80].
In the meta-analyses, the pooled effect for ML model was higher than traditional models.However, while the proportion of heterogeneity was similar for both modelling types (I 2 : > 99%), the scale was far higher for ML models.As such, the 95% prediction interval for new risk models using ML was wider, limiting its usefulness.Traditional models had lower mean effect and a narrower prediction interval, suggesting it as a more conservative approach.However, we note that the scale of heterogeneity is likely to also be affected by our selection of model results.While only 13 of 47 studies included in meta-analyses used machine learning models, we included 53 results from these.This was far more results per study compared to the 46 results from 39 studies using traditional regression-based models.
The few studies applying machine learning may partly explain why only one moderator was found significant in explaining heterogeneity.As moderators largely described study characteristics, the actual number of unique datapoints for these were 13, i.e., the number of studies applying ML.Using the three-level model, the repeated moderator values were accounted for to some extent, as shown by the more even distribution of influence of each study on the pooled estimates.
For studies applying traditional models, time from baseline to outcome determination, outcome rate, and the median/mean age of the cohort at baseline was found significant when included as individual moderators, with only baseline age significant upon including all.Baseline age was also significant as a moderator for the Framingham external validations.
Studies with younger individuals in their cohorts reported better results.Increased age is a known risk factor for developing hypertension, hence it's closely connected to both outcomerate and follow-up time.Older individuals will have higher outcome-rate, and time from baseline to outcome determination is simply the age-delta of the cohort from baseline to endpoint [2].Nevertheless, the inclusion of moderators barely explained any heterogeneity in any case, reducing I 2 by less than 2%.While other known risk factors of hypertension could be relevant as moderators, e.g., baseline blood pressure, their reporting was inadequate to be tested as moderators without excluding large parts of the included results.
Using the three-level model, analyses estimated considerable correlation within studies for both traditional and ML models.This similarity within studies suggests within-study comparisons using meaningful alternatives are needed for making judgment on the effect of various modelling choices.Hence, the effect of using a machine learning model, the utility of different data sources, or the inclusion of a new sub-group of individuals can only be assessed meaningfully when compared against alternatives within the same study.
As we meta-analyzed the external validations of the Framingham risk model, we could investigate a case independent of variation in model development methods.Even so, heterogeneity was estimated to account for more than 99% of the variation in results, with baseline age as the only significant moderator.As model development was not relevant, the persistent high heterogeneity suggests that heterogeneity was more related to other aspects, e.g., cohort characteristics or the recording of data.While noting that only a single model was considered, this underlines that confidence intervals of results presented in studies should only be considered relevant within the context of the study it is reported in.To exemplify, the external validation AUC/C-stat.reported for the Framingham risk model ranged from 0.537 to 0.84.This was a far higher variation than suggested by the bootstrapped optimism of 0.0003 or the 95% confidence interval of 0.733 to 0.803 reported in its development article [18].
In 13 of the 16 studies that externally validated a model, a new risk model was presented as well.External validation of a risk model can go beyond simple application of the model in a new population.Several methods may be tested to see if an external model can be made effective in a new population with relatively little effort, e.g., by recalibration or re-estimation of coefficients [81].The advantage of a thorough external validation of existing models when a new risk model is proposed is two-fold: Failing to obtain favorable results using external models would argue for the creation of a new risk model.Simultaneously, a thorough development of a new risk model will likely produce a best-case scenario performance-wise, which could serve as context for the external validation.Lastly, we note that four of the included studies validated multiple models [30,43,69,74].
External validations are useful for testing performance outside of the development cohort and require that risk models are made available.Only four studies applying machine learning had made one of their final models readily available for external researchers, likely explaining partly why no machine learning models were found to be externally validated by independent researchers.The four ML models that were available had low complexity, allowing a full graphical presentation, e.g., a Bayesian network or a decision tree [34,42,45,52].No models were found to be shared by online resources.Traditional models are easier to share without resorting to online resources.Several studies specifically emphasized application in clinical practice as a motivation and presented simplified versions of their risk model for easier use by clinicians in the form of nomograms, risk scores or decision rules.
Another aspect that challenges external validation and reproducibility, is the increasing use of datasets derived from Electronic Medical Records (EMR).These often exceed traditional study cohorts in both number of participants and the amount of clinical information, increasing the information load underlying any risk model development.This suggests that reporting should be even more rigorous.As an example of the opposite, in three studies [50,53,70] developing risk models using EHRs, a complete list of variables used in the final models was not reported.
Most included studies only reported discrimination performance measures, especially in studies where ML models were developed.Neglecting other performance measures such as calibration or clinical impact is common, although discrimination ultimately only provides a partial view of a model's total performance [6][7][8]82,83].

Limitations of our study
We note that we have included fewer studies than earlier, relevant reviews.While our inclusion criteria were more restrictive, there can be variation in distinguishing prediction model development from association studies due to similarities in how models are developed, reported, and assessed.We excluded several studies with prognostic models where the focus was fixed on one or a specific set of similar variables as none of these followed recommended procedures for creating risk models.Further, they were often exclusively focused on their specific research niche, implying exponentially more effort would be needed to identify all such studies.Most included studies using genetic information were edge-cases in this sense but were included as they were explicitly labeled as risk scores or risk models for incident hypertension.
Applying the original PROBAST framework upon studies developing ML models may be ill-advised.However, we deemed it relevant as all models were developed for a similar purpose as the traditional models.With the publication of the PROBAST-AI framework, better assessment of risk models based on ML will likely be possible [65].
A significant limiting factor was that we only focused on the performance measure of AUC/ C-statistic.Both discrimination and calibration should be assessed simultaneously in metaanalysis to increase power of the analysis [9].Incomplete reporting as well as variation in methods used meant we were unable to incorporate it into our meta-analyses [14,82].
Lastly, we did not assess publication bias of our included results, similar to an earlier, relevant review [8].Assessing publication bias was not emphasized in a methodological guideline for systematic reviews on prediction models [9].

Conclusion
The increase in the number of articles and research-effort relevant for hypertension risk modelling may produce insights on creating better models, highlight limitations of existing ones, and contribute to determine how well risk may be predicted in different populations.
We found 53 studies focused on developing or validating a prognostic risk model for incident hypertension.There was rich diversity in cohort origin, methods applied, and subsequent results obtained.The quality of studies was found to be poor, with only a small minority assessed as low risk of bias using the PROBAST framework.Moreover, specific issues for the studies developing ML models were developed models not being made available and incomplete reporting of the used input variables.
We applied a three-level model meta-regression to analyze the reported AUC/C-statistics, as it was the only performance measure reported to a sufficient degree.Model discrimination was found to be acceptable to good in many cases, and seemingly higher for ML models than traditional models.However, high heterogeneity was seen for both model groups.This suggests considerable variability in the performance of new models.
Only one model, the Framingham risk model, had been externally validated more than three times, and we found large heterogeneity in these external validations.This indicates that there is also large variability in how well models translate to new populations.Despite this, only 16 of 53 included studies reported doing an external validation of an existing model.
Based on our findings, we have identified several items that can enable the research community to better assess hypertension risk models.Broader adherence to existing guidelines for reporting and developing risk models like TRIPOD, and specifically reporting appropriate performance measures beyond discrimination, can help improve the quality of reporting.Further, we recommend a stronger focus on validation so that sources of improvement in risk modelling are identifiable and existing risk models are evaluated.This implies considering reasonable baseline models and performing external validations of existing models.To enable this, any developed risk models and required information for practical use need to be made available for external researchers.

Fig 4 .
Applicability was mostly unclear due to the use of cohorts with only middle-aged or older individuals, e.g., all being 40 years of age or more.

Fig 2 .Fig 3 .
Fig 2.Variables used in studies.Variables were counted as those used in any final developed model in a study.We summarize by studies due to in the number of developed models per study.Variables used by only one single study were either merged with similar ones, or grouped as"Other" within its category.Note: Variable information from five studies were excluded as they did not report complete information, meaning variable information from 45 studies developing new models are included here.'BMI': Body Mass Index, 'BP': Blood Pressure,' Chol.':Cholesterol, 'HDL': High-density lipoprotein, 'LDL': Low-density lipoprotein, 'Misc': Miscellaneous, 'SNPs': Single nucleotide polymorphisms.https://doi.org/10.1371/journal.pone.0294148.g002